Weakly Informative Prior for Covariance Matrices 1 Running head: WEAKLY INFORMATIVE PRIOR FOR COVARIANCE MATRICES Weakly Informative Prior for Point Estimation of Covariance Matrices in Hierarchical Models
نویسندگان
چکیده
When fitting hierarchical regression models, maximum likelihood estimation has computational (and, for some users, philosophical) advantages compared with full Bayesian inference, but when the number of groups is small, estimates of the covariance matrix (Σ) of group-level varying coefficients are often degenerate. One can do better, even from a purely point-estimation perspective, by using a prior distribution or penalty function. In this paper, we use Bayes modal estimation to obtain positive definite covariance matrix estimates. We recommend a class of Wishart (not inverse-Wishart) priors for Σ with a default choice of hyperparameters: the degrees of freedom are set equal to the number of varying coefficients plus 2, and the scale matrix is the identity matrix multiplied by a value that is large relative to the scale of the problem. This prior is equivalent to independent gamma priors for the eigenvalues of Σ with shape parameter 1.5 and rate parameter close to 0. It is also equivalent to independent gamma priors for the variances with the same hyperparameters multiplied by a function of the correlation coefficients. With this default prior, the posterior mode for Σ is always strictly positive definite. Furthermore, the resulting uncertainty for the fixed coefficients is less underestimated than under classical maximum likelihood or restricted maximum likelihood. We also suggest an extension of our method that can be used when stronger prior information is available for some of the variances or correlations. Weakly Informative Prior for Covariance Matrices 3 Weakly Informative Prior for Point Estimation of Covariance Matrices in Hierarchical Models Hierarchical or mixed-effects regression models are increasingly popular in applied statistics and are Bayesian on two levels: A prior distribution is assigned to the varying coefficients, and the parameters of that prior distribution themselves are given a hyperprior. The family of models can be written in general terms as follows: data are in groups j = 1, . . . , J . For each group j, there is a response vector yj and two data matrices, Xj and Zj, that have fixed and varying coefficients, respectively. The data model is p(yj|Xjβ + Zjbj), where β is the vector of fixed coefficients and bj is the vector of regression coefficients that varies by group. The vectors bj are modeled as independent draws from a prior distribution, p(bj), given some hyperparameters. We shall assume a normal model for the varying coefficients, so that bj ∼ N(0,Σ). The model could also include a nonzero mean vector or a group-level regression structure for the hyperprior distribution, but these can be folded into the fixed coefficients in the data model without loss of generality. There is a rich literature on full Bayesian inference for hierarchical regressions. There is also an empirical Bayes version in which the hyperparameters (in this case, Σ) are estimated via maximum likelihood and then inference for the coefficients is performed conditional on the estimated Σ. From the Bayesian perspective, the empirical Bayes approach is suboptimal, both because it avoids the use of any prior information on Σ and because it understates posterior uncertainty. From a pragmatic perspective, however, we recognize that the point estimation approach has two advantages that give it great appeal to many users. First, existing software such as lme4 in R and various commands in Stata allow such models to be fit fast and reliably for moderate-sized datasets, whereas MCMC software for full-Bayes inference is not yet so immediately practical. Second, the Weakly Informative Prior for Covariance Matrices 4 non-Bayesian motivation behind point estimation is attractive to practitioners who want the benefits of partial pooling and hierarchical modeling without needing to specify prior information or fully buy into the Bayesian paradigm. The subject of the present article, as with its predecessor on varying-intercept models with constant coefficients (Chung, Rabe-Hesketh, Dorie, Gelman, & Liu, 2013), is the use of Bayesian ideas and methods to produce better inferences for hierarchical models via better point estimates of the hyperparameters. In that sense, this work falls into a long tradition of Bayesian tools used for practical non-Bayesian inferences (e.g., Agresti & Coull, 1998). Bayes modal estimation (or penalized likelihood) has also been used to obtain more stable estimates in item response theory (e.g., Swaminathan & Gifford, 1985; Mislevy, 1986; Tsutakawa & Lin, 1986) and to avoid boundary estimates (or logit parameters tending to ±∞) in log-linear models (Galindo-Garre, Vermunt, & Bergsma, 2004), logistic regression (Gelman, Jakulin, Pittau, & Su, 2008), and latent class analysis (Maris, 1999; Galindo-Garre & Vermunt, 2006). Such an approach has also been used to obtain non-degenerate covariance matrices in factor analysis (Martin & McDonald, 1975), finite mixtures of normal densities (Ciuperca, Ridolfi, & Idier, 2003; Vermunt & Magidson, 2005) and in multivariate regression (Warton, 2008). The key problem solved by our method is the tendency of maximum likelihood estimates of Σ to be degenerate, that is, on the border of positive-definiteness, which corresponds to zero variance or perfect correlation among some linear combinations of the parameters. When the maximum likelihood estimate of a hierarchical covariance matrix is degenerate, this often arises from a likelihood that is nearly flat in the relevant dimension and just happens to have a maximum at the boundary. Our solution is a class of weakly informative prior densities for Σ that go to zero on the boundary as Σ becomes degenerate, thus ensuring that the posterior mode (i.e., the maximum penalized likelihood estimate) is always nondegenerate. We recommend a class Weakly Informative Prior for Covariance Matrices 5 of Wishart priors with a default choice of hyperparameters: the degrees of freedom is the dimension of bj plus two and the scale matrix is the identity matrix multiplied by a large enough number. This prior can be expressed as a product of gamma(1.5, θ) priors on the eigenvalues of Σ or as a product of gamma(1.5, θ) priors on variances of the varying effects with rate parameter θ → 0 and a function of the correlations (a beta prior in the two-dimensional case). In the varying-intercept model (Chung, Rabe-Hesketh, Dorie, et al., 2013) and random-effects meta-analysis model (Chung, Rabe-Hesketh, & Choi, 2013), the gamma(1.5, θ) prior successfully avoids boundary estimates while producing estimates that are consistent with the data. We show that this is also true for the default Wishart prior proposed in this paper for general varying coefficient models. In a simulation study and an education example presented below, the default Wishart prior always gives nondegenerate estimates of Σ (in particular, non-perfect correlation coefficients) without decreasing the log-likelihood substantially. The standard deviations and the correlation between random effects estimators using the Wishart prior have better statistical properties than using (restricted) maximum likelihood. When prior information is available for specific standard deviations or correlations, additional penalty functions may be included. Specifically, if the prior most plausible value for a standard deviation or correlation parameter is σ or ρ respectively, then we propose multiplying the Wishart prior by the gamma(2, 2/σ) or N(ρ, 0.252) densities. This to assigns more prior probability around the preferred values while exploiting the property of the Wishart prior that it ensures that the estimates remain positive definite. The outline of the paper is as follows. First, we illustrate the boundary estimation problems encountered in maximum likelihood estimation of hierarchical variance and covariance parameters. Then we introduce the default Wishart prior for Σ and investigate its properties. Next, additional penalty functions are proposed that incorporate further prior knowledge for some of the parameters. Finally, our method is applied to an example Weakly Informative Prior for Covariance Matrices 6 from education research and simulated data. Boundary estimation problem Consider the varying-coefficients model, yij = x T ijβ + z T ijbj + ǫij, i = 1, . . . , nj, j = 1, . . . , J, (1) where yij is the response variable for unit i in group j, xij is a p-dimensional covariate vector with constant coefficients β, zij is a d-dimensional data vector with varying coefficients bj ∼ N(0,Σ) , and ǫij ∼ N(0, σ 2 ǫ ) is a residual for each observation. We further assume that bj and ǫij are independent. Non-Bayesian point estimation For each j, yj = (y1j , . . . , ynjj) ∼ N(Xjβ, Vj) , where Xj is a nj × p matrix with x T ij in the ith row, Vj = ZjΣZ T j + σ 2 ǫ I, and Zj is a nj × d matrix with z T ij in the ith row. The log-likelihood function is log p(y|β,Σ, σ ǫ ) = − 1 2
منابع مشابه
Weakly Informative Prior for Point Estimation of Covariance Matrices in Hierarchical Models
When fitting hierarchical regression models, maximum likelihood (ML) estimation has computational (and, for some users, philosophical) advantages compared to full Bayesian inference, but when the number of groups is small, estimates of the covariance matrix (S) of group-level varying coefficients are often degenerate. One can do better, even from a purely point estimation perspective, by using ...
متن کاملBayesian Estimates for Vector - Autoregressive Models
This paper examines frequentist risks of Bayesian estimates of VAR regression coefficient and error covariance matrices under competing loss functions, under a variety of non-informative priors, and in the normal and Student-t models. Simulation results show that for the regression coefficient matrix an asymmetric LINEX estimator does better overall than the posterior mean. For the error covari...
متن کاملPrior distributions for variance parameters in hierarchical models
Various noninformative prior distributions have been suggested for scale parameters in hierarchical models. We construct a new folded-noncentral-t family of conditionally conjugate priors for hierarchical standard deviation parameters, and then consider noninformative and weakly informative priors in this family. We use an example to illustrate serious problems with the inverse-gamma family of ...
متن کاملBayesian correlation estimation , shadow prior , portfolio selection with higher moments ) COVARIANCE MATRICES AND SKEWNESS : MODELING AND APPLICATIONS IN
Bayesian correlation estimation, shadow prior, portfolio selection with higher moments) COVARIANCE MATRICES AND SKEWNESS: MODELING AND APPLICATIONS IN FINANCE by Merrill Windous Liechty Institute of Statistics and Decision Sciences Duke University
متن کاملAvoiding Boundary Estimates in Linear Mixed Models Through Weakly Informative Priors
Variance parameters in mixed or multilevel models can be difficult to estimate, especially when the number of groups is small. Here we address the problem that the group-level variance estimate is often on the boundary. We propose a maximum penalized likelihood approach which is equivalent to estimating the variance by its marginal posterior mode, given a weakly informative prior distribution. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014